Links to material and slides for this course can be found on github.
Or can be downloaded as a zip archive from here.
Once the zip file in unarchived. All presentations as HTML slides and pages will be available in the directories underneath.
overview
Full image here of workflow is here: overview
scRNAseq is a very variable process and each dataset has unique QC problems. Though a simple workflow often works, we often need many more tools in our toolbox to handle more complex data sets. Deciding what is appropriate often depends on the data set and its QC metrics.
Cell Ranger is a suite of tools for single cell processing and analysis available from 10X Genomics. It performs key processing steps i.e. mapping, and is the first chance to delve into your datasets QC.
In this session we will give a brief overview or running this
Then dive deeper into interpreting outputs, ]
]
Often genomics centers will run it for you (i.e. here at Rockefeller)
Why run Cell Ranger:
Cell Ranger is available from the 10x genomics website.
Also available are pre-baked references for Human and Mouse genomes (GRCh37/38 and GRCm37)
] ]
[This will all be in terminal on the server you are using].
wget -O cellranger-8.0.0.tar.gz "https://cf.10xgenomics.com/releases/cell-exp/cellranger-8.0.0.tar.gz?Expires=1711772964&Key-Pair-Id=APKAI7S6A5RYOXBWRPDA&Signature=muvzcbqxba6d-blyYS02MVfLlzwZk6iZNQWXdaoCLnl7owW2nEN-IHwSPwdNoYl-6Xia7rr0S1sLCUQTsekGm2pQKcd0kqK~ndHK0DM7SwSVpXLlRvBV5pXt~EIlsxATVBKVeQLnUy698N-WnRlT~ahjlU-nMdpomX9-lOkF~w8gbgHBdtPXunTWfW87sSJLpHMDVENSF7TFJsXERDwDnsXyQLCuEhfGTCOnupkaATlLEr9kaeCStePKkwGyqgi1m8Ua02NNGHWPIJ6I1mDt695wo~dgptpJF4SDNRTyE-TuXrHfIqRjZB60zhWRJczFo2kpL7FCKwliE-vJ6djcSw__"
Download reference for Human genome (GRCh38)
wget "https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2024-A.tar.gz"
Unpack software and references.
tar -xzvf cellranger-7.2.0.tar.gz
tar -xzvf refdata-gex-GRCh38-2020-A.tar.gz
export PATH=/PATH_TO_CELLRANGER_DIRECTORY/cellranger-7.1.0:$PATH
Now we have the downloaded Cell Ranger software and required pre-build reference for Human (GRCh38) we can start the generation of count data from scRNA-seq/snRNA-seq fastQ data.
Typically FastQ files for your scRNA run will have been generated using the Cell Ranger mkfastq toolset to produce a directory a FastQ files.
We can now use CellRanger count command with our reference and fastQ files to generate our count matrix and associated files.
If you are analyzing single nuclei RNA-seq data remember to set the –include-introns flag.
cellranger count --id=my_run_name \
--fastqs=PATH_TO_FASTQ_DIRECTORY \
--transcriptome=/PATH_TO_CELLRANGER_DIRECTORY/refdata-gex-GRCh38-2020-A
If you are working with a genome which is not Human and/or mouse you will need to find another source for your Cell Ranger reference.
Luckily many references are pre-built by other consortiums.
We can build our own references using other tools in Cell Ranger using the mkgtf and mkref functions.
For this you just need a FASTA file (DNA sequence) and a GTF file (Gene Annotation) for your reference.
Importantly for Cell Ranger Count, only features labelled as exon in your GTF (column 3) will be considered for counting signal in genes
Many genomes label mitochondrial genes with CDS and not exon so these must be updated
Having completed the Cell Ranger count step, the user will have created a folder named as set by the –id flag for the count command.
Within this folder will be the outs/ directory containing all the outputs generated from Cell Ranger count.
The count matrices to be used for further analysis are stored in both MEX and HDF5 formats within the output directories.
The filtered matrix only contains detected, cell-associated barcodes whereas the raw contains all barcodes (background and cell-associated).
MEX format - filtered_feature_bc_matrix - raw_feature_bc_matrix
HDF5 format - filtered_feature_bc_matrix.h5 - raw_feature_bc_matrix.h5
The outs directory may also contain a BAM file of alignments for all barcodes against the reference (possorted_genome_bam.bam) as well as an associated BAI index file (possorted_genome_bam.bam.bai). This is no longer a default output.
This BAM file is sometimes used in downstream analysis such as scSplit/Velocyto as well as for the generation of signal graphs such as bigWigs. ]


]
Cell Ranger also outputs files for visualization within the (10X Loupe browser software)[https://www.10xgenomics.com/support/software/loupe-browser/latest] - cloupe.cloupe.
This allows for the visualization of scRNA-seq/snRNA-seq as a t-sne/umap with the ability to overlay metrics of QC and gene expression onto the cells in real time
Assessment of the overall quality of a scRNA-seq/snRNA-seq experiment after Cell Ranger can give our first chance to dig into the quality of your dataset and gain insight any issues we might face in data analysis.
Cell Ranger will also output summaries of useful metrics as a text file (metrics_summary.csv) and as a intuitive web-page.
Metrics include
There are many potential issues which can arise in scRNA-seq/snRNA-seq data including -
]
The web summary html file contains an interactive report describing the most essential QC for your single cell experiment as well as initial clustering and dimension reduction for your data.
The web summary also contains useful information on the input files and the versions used in this analysis for later reproducibility. ]
]
The first thing we can review is the Sample information panel. (Bottom Right) As most people do not run CellRanger themselves this is important to check it matches expectation (often genomics cores will run it for you) - Sample ID - Sample name (Assigned in cellranger count). - Chemistry - The 10x chemistry used. - Include introns - Whether counting was run to include intron counts (typical for single neuron RNA-seq). - Reference Path and Transcriptome - References used in analysis. - Pipeline Version - Version of Cell Ranger used.
]
]
The Sequencing panel highlights information on the quality of the Illumina sequencing. Top Right.
]
]
Key Metrics we look for:
* Q30 Bases in RNA Read > 65% (usually > 80%)
+ Reflects the sequencing quality
+ Need to check with sequencing service supplier
* Sequencing Saturation > 40% (usually range 20% ~ 80%)
+ Reflects the complexity of libraries
+ Consider reconstructing library if too low
The Mapping panel highlights information on the mapping of reads to the reference genome and transcriptome. Bottom Left
]
]
Key Metrics we look for:
Mapped to Genome > 60% (usually range 50% ~ 90%) + Mapping rate to reference genome + Check reference genome version if too low
The Cells panel highlights some of the most important information in the report: the total number of cells captured and the distribution of counts across cells and genes. Top Right
]
]
Key Metrics we look for:
Fraction Reads in Cells > 70% (usually > 85%): + Reflects the ambient RNA contamination + Consider correcting for ambient RNA if < 90%
Median reads per cell > 20,000/cell and estimated number of cells 500 ~ 10,000 + May be caused by the failure of cell identification if the values were not in normal range + Need to check knee plot and re-evaluate cell number
The Cell panel also includes an interactive knee plot.
The knee plot shows:-
On the x-axis, the barcodes ordered by the most frequent on the left to the least frequent on the right
On the y-axis, the frequency of each ordered barcode.
Highlighted in dark blue are the barcodes marked as associated to cells.
]
]
It is apparent that barcodes labelled blue (cell-associated barcodes) do not have a cut-off just based on the UMI count.
In the latest version of Cell Ranger a two step process is used to define cell-associated barcodes based on the EmptyDrops method (Lun et al.,2019).
If required, a –force-cells flag can be used with cellranger count to identify a set number of cell-associated barcodes.
It is important to know what version and parameters were used to run Cell Ranger. This cell calling step is continually updated and it can have a dramatic affect on your results. (Cell Ranger V4.0 just dropped last month and theres again been a hange in default paramteres)
]
]
The Knee plot also acts a good QC tools to investigate differing types of single cell failure.
Whereas our previous knee plot represented a good sample, differing knee plot patterns can be indicative of specific problems with the single cell protocol. We will show you some examples of these below from real data.
In this example we see no specific cliff and knee suggesting a failure in the integration of oil, beads and samples (wetting failure) or a compromised sample.
]
]
If there is a clog in the machine we may see a knee plot where the overall number of samples is low.
]
]
There may be occasions where we see two sets of cliff-and-knees in our knee plot.
This could be indicative of a heterogenous sample where we have two populations of cells with differing overall RNA levels.
Knee plots should be interpreted in the context of the biology under investigation.
]
]
The web-summary also contains an analysis page where default dimension reduction, clustering and differential expressions between clusters has been performed.
Additionally the analysis page contains information on sequencing saturation and gene per cell vs reads per cell.
]
]
The t-sne plot shows the distribution and similarity within your data.
]
]
The sequence saturation and Median genes per cell plots show these calculations (as show on summary page) over successive downsampling of the data.
By reviewing the curve of the down sampled metrics we can assess whether we are approaching saturation for either of these metrics.
]
]
Could show examples of high MT specifics of non-specific clustering or MT clustering.
Maybe lucky there’s no QC issues. Often at this step we wouldn’t make any decisions unless there is a clear complete failure. But important first step in setting expectations/preparing for what you may need to do for the dataset. QC and custom reads ins. Dealing with issues.